Skip to content

Conversation

@SaintBacchus
Copy link
Contributor

DynamicAllocation will set the total executor to a little number when it wants to kill some executors.
But in no-DynamicAllocation scenario, Spark will also set the total executor.
So it will cause such problem: sometimes an executor fails down, there is no more executor which will be pull up by spark

@SparkQA
Copy link

SparkQA commented Jun 5, 2015

Test build #34244 has finished for PR 6662 at commit 016214d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 5, 2015

Test build #34242 has finished for PR 6662 at commit 610c390.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how this method is called with dynamic allocation disabled?

The only call chain I can find starts with ExecutorAllocationManager, which is not instantiated when dynamic allocation is off.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do sc.requestTotalExecutors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean:

private[spark] override def requestTotalExecutors

I don't see any calls to it, and given it's private[spark]...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I meant sc.requestExecutors, which eventually calls the method here.

@andrewor14
Copy link
Contributor

@SaintBacchus can you elaborate on the description a little? I'm not sure if I follow what the symptoms are and how you reproduced them.

@SaintBacchus
Copy link
Contributor Author

@andrewor14 @vanzin I draw a simple call stack, as this:
image

If the doRequestTotalExecutors logic happened, it reset the total executors of the application.
But there was a prolem: at the monment if alive executor was not the same with origin, the Spark will never pull thenm up again.
This simple scenario can reproduce this issue:

  • There are 2 applications and each wants 2 executor, so total 4 cup cores wanted(every executor wants one core).
  • But the RM only has 3 cores, so when first application(A) gained 2 cores and second applicaiton(B) gained only one core waitting A release the cores.
  • Then kill one of the A's executor, B will pull up its executor and let A wait the resource.
  • After the TimeOut logic occures in A then B application has finished its job and releases its resource.
  • As the expection, A wil push its anohter other executor again but actually it will never happen.

A may be a Streaming application.

@SaintBacchus
Copy link
Contributor Author

@andrewor14 did I describe the scenario clearly? can you review it again?

@andrewor14
Copy link
Contributor

I see. The issue is that the AM forgets about the original number of executors it wants after calling sc.killExecutor. Even if dynamic allocation is not enabled, this is still possible because of heartbeat timeouts.

I think the problem is that sc.killExecutor is used incorrectly in HeartbeatReceiver. The intention of the method is to permanently adjust the number of executors the application will get. In HeartbeatReceiver, however, this is used as a best-effort mechanism to ensure that the timed out executor is dead.

@andrewor14
Copy link
Contributor

I have updated the description on the JIRA. However, this patch is definitely not the correct fix. The user should be able to call sc.requestExecutors (a public developer API) even when dynamic allocation is not enabled. This patch disallows that.

I'll submit a fix separately. In the mean time, could you close this PR? Thanks for your work @SaintBacchus.

@SaintBacchus
Copy link
Contributor Author

OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants